Beta-Negative Binomial Process and Poisson Factor Analysis 



Mingyuan Zhou Lauren A. Hannah^ David B. Dunson^ Lawrence Carin 

Department of ECE, ^Department of Statistical Science, Duke University, Durham NC 27708, USA 



Abstract 

A beta-negative binomial (BNB) process is 
proposed, leading to a beta-gamma-Poisson 
process, which may be viewed as a "multi- 
scoop" generalization of the beta-Bernoulli 
process. The BNB process is augmented 
into a beta-gamma-gamma-Poisson hierar- 
chical structure, and applied as a nonpara- 
metric Bayesian prior for an infinite Poisson 
factor analysis model. A finite approximation 
for the beta process Levy random measure is 
constructed for convenient implementation. 
Efficient MCMC computations are performed 
with data augmentation and marginalization 
techniques. Encouraging results are shown 
on document count matrix factorization. 



1 Introduction 

Count data appear in many settings. Problems in- 
clude predicting future demand for medical care based 



on past use (Cameron et al. 1988 Deb and Trivedi 



1997), species sampling (National Audubon Society) 



and topic modeling of document corpora (Blei et al. 



2003). Poisson and negative binomial distributions 



are typical choices for univariate and repeated mea- 
sures count data; however, multivariate extensions in- 
corporating latent variables (latent counts) are under 
developed. Latent variable models under the Gaus- 
sian assumption, such as principal component anal- 
ysis and factor analysis, are widely used to discover 



low-dimensional data structure ( Lawrence 2005 Tip- 



|ping and Bishop) |1999| |West[ |2003| |Zhou et al.[|2009 1 



There has been some work on exponential family la- 
tent factor models that incorporate Gaussian latent 



variables (Dunson 2000 2003 Moustaki and Knott 



2000 Sammel et al. 1997), but computation tends 
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to be prohibitive in high-dimensional settings and the 
Gaussian assumption is restrictive for count data that 
are discrete and nonnegative, have limited ranges, and 
often present overdispersion. In this paper we pro- 
pose a flexible new nonparametric Bayesian prior to 
address these problems, the beta-negative binomial 
(BNB) process. 



Using completely random measures (Kingman 19671, 



Thibaux and Jordan ( 2007 ) ge nerali ze the beta process 
defined on [0, 1] x R + by Hjort ( 1990 1 to a general prod- 
uct space [0, 1] x fi, and define a Bernoulli process on an 
atomic beta process hazard measure to model binary 
outcomes. They further show that the beta-Bernoulli 
process is the underlying de Finetti mixing distribu- 



tion for the Indian buffet process (IBP) of Griffiths 
and Ghahramani (2005). To model count variables, 



we extend the measure space of the beta process to 



[0,1] 



x Q and introduce a negative binomial pro- 



cess, leading to the BNB process. We show that the 
BNB process can be augmented into a beta-gamma- 
Poisson process, and that this process may be inter- 
preted in terms of a "multi-scoop" IBP. Specifically, 
each "customer" visits an infinite set of dishes on a 
buffet line, and rather than simply choosing to select 
certain dishes off the buffet (as in the IBP), the cus- 
tomer may select multiple scoops of each dish, with 
the number of scoops controlled by a negative bino- 
mial distribution with dish-dependent hyperparame- 
ters. As discussed below, the use of a negative bino- 
mial distribution for modeling the number of scoops 
is more general than using a Poisson distribution, as 
one may control both the mean and variance of the 
counts, and allow overdispersion. This representation 
is particularly useful for discrete latent variable mod- 
els, where each latent feature is not simply present or 
absent, but contributes a distinct count to each obser- 
vation. 

We use the BNB process to construct an infinite dis- 
crete latent variable model called Poisson factor anal- 
ysis (PFA), where an observed count is linked to its 
latent parameters with a Poisson distribution. To en- 
hance model flexibility, we place a gamma prior on 
the Poisson rate parameter, leading to a negative bi- 
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nomial distribution. The BNB process is formulated in 
a beta-gamma-gamma-Poisson hierarchical structure, 
with which we construct an infinite PFA model for 
count matrix factorization. We test PFA with various 
priors for document count matrix factorization, mak- 
ing connections to previous models; here a latent count 
assigned to a factor (topic) is the number of times that 
factor appears in the document. 

The contributions of this paper are: 1) an extension 
of the beta process to a marked space, to produce the 
beta-negative binomial (BNB) process; 2) efficient in- 
ference for the BNB process; and 3) a flexible model 
for count matrix factorization, which accurately cap- 
tures topics with diverse characteristics when applied 
to topic modeling of document corpora. 

2 Preliminaries 

2.1 Negative Binomial Distribution 

The Poisson distribution X ~ Pois(A) is commonly 
used for modeling count data. It has the probability 
mass function fx(k) — e x X k /k\, where k € {0, 1, . . . }, 
with both the mean and variance equal to A. A gamma 
distribution with shape r and scale p/(l — p) can be 
placed as a prior on A to produce a negative binomial 
(a.k.a, gamma-Poisson) distribution as 

/>oo 

fx(k) = / Pois(fc; A)Gamma (A; r, p/(l — p)) dX 
Jo 

_ T(r + fc) 



k\T(r) 



(1-pYp 



(1) 



where T(-) denotes the gamma function. Parame- 
terized by r > and p G (0,1), this distribution 
X ~ NB(r, p) has a variance rp/(l — p) 2 larger than 
the mean rp/(l — p), and thus it is usually favored 
for modeling overdispersed count data. More detailed 
discussions about the negative binomial and related 
distributions and the corresponding stochastic pro- 
cesses defined on M. + can be found in Kozubowski and 



Podgorski (2009) and Barndorff-Nielsen et al. (2010) 



2.2 Levy Random Measures 

Beta-negative binomial processes are created using 
Levy random measures. Following Wol pert et al.| 
(2011), for any v + > and any probability distri- 
bution ir(dpdui) on R x O, let K ~ Pois(^ + ) and 

{{Pk,Uk)}\<k<K ~ n(dpduj). Defining Ia(^) as be- 
ing one if LOk € A and zero otherwise, the random 
measure C(A) = X^fc=i ^-A{oJk)Pk assigns independent 
infinitely divisible random variables C(Ai) to disjoint 
Borel sets Ai C Cl, with characteristic functions 



E[< 



exp 



(e 4tp - l)v{dpduj) \ (2) 



xA 



with v{dpduj) = i> + it{dpdu)). A random signed mea- 
sure L satisfying ([2| is called a Levy random measure. 
More generally, if the Levy measure v(dpduj) satisfies 



(1 A \p\)v(dpduj) < oo 



(3) 



xS 



for each compact S C Cl, it need not be finite for the 
Levy random measure C to be well defined; the nota- 
tion 1 A \p\ denotes min{l, \p\}. A nonnegative Levy 
random measure C satisfying ^ was called a com- 



pletely random measure (CRM) by Kingman ( 1967 
1993 ) and an additive random measure by Qinlar 



(2011). It was introduced for machine learning by 



Thibaux and Jordan (2007) and Jordan (2010). 



2.3 Beta Process 



The beta process (BP) was defined by Hjort ( 1990 ) for 
survival analysis with CI = K+. Thibaux and Jordan 



( 2007 ) generalized the process to an arbitrary measur- 



able space Cl by defining a CRM Bona product space 
[0, 1] x Cl with the Levy measure 

v B p{dpduj) = cp-\l - pf^dpBoidu)). (4) 

Here c > is a concentration parameter (or concentra- 
tion function if c is a function of cj), B is a continuous 
finite measure over Cl, called the base measure, and 
a = Bq(CI) is the mass parameter. Since v^p{dpdu) 
integrates to infinity but satisfies ([3]), a countably in- 
finite number of i.i.d. random points {(pk,^k)}k=i.oo 
are obtained from the Poisson process with mean mea- 
sure i/bp and Pk is finite, where the atom uik £ Cl 
and its weight p^ € [0, 1]. Therefore, we can express a 
BP draw, B ~ BP(c, B ), as B = Y^=iPk^ h , where 
5 Uk is a unit measure at the atom Uk- If Bq is dis- 
crete (atomic) and of the form B = ^2 k qk5^ k , then 
B = Y.kPk^k with p k - Beta(cg fe ,c(l - q k )). If B 
is mixed discrete-continuous, B is the sum of the two 
independent contributions. 

3 The Beta Process and the Negative 
Binomial Process 



Let B be a BP draw as defined in Sec. |2.3[ and there- 
fore B = J2kLiPk$uk- ^ Bernoulli process BeP(£>) 
has atoms appearing at the same locations as those of 
B; it assigns atom uj k unit mass with probability p k , 
and zero mass with probability 1 — p^, i.e., Bernoulli. 
Consequently, each draw from BcP(i3) selects a (fi- 
nite) subset of the atoms in B. This construction is 
attractive because the beta distribution is a conjugate 
prior for the Bernoulli distribution. It has diverse ap- 



plications including document classification (Thibaux 



and Jordan 20071, dictionary learning (Zhou et al. 



2012 2009 2011) and topic modeling (Li et al. 2011) 



The beta distribution is also the conjugate prior for 
the negative binomial distribution parameter p, which 
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suggests coupling the beta process with the negative 
binomial process. Further, for modeling flexibility it is 
also desirable to place a prior on the negative-binomial 
parameter r, this motivating a marked beta process. 

3.1 The Beta-Negative Binomial Process 

Recall that a BP draw B ~ BP(c, Bq) can be consid- 
ered as a draw from the Poisson process with mean 
measure ^bp m (gj). We can mark a random point 
(tOk,Pk) of B with a random variable Tk taking values 
in R + , where r k and tv arc independent for k =/= k' . 



Using the marked Poisson processes theorem (King- 
{(pk,r k , Wfc)}fc=i,oo can be regarded as 



man 



1993) 



random points drawn from a Poisson process in the 
product space [0, 1] x R + x fi, with the Levy measure 

v^ P (dpdrdu) = cp-^l-pY^dpRoid^Boiduj) (5) 

where Rq is a continuous finite measure over M + and 
7 = i?o(M + ) is the mass parameter. With |5| and us- 
ing RqBq as the base measure, we construct a marked 
beta process B* ~ BP(c, RqB ), producing 



B* = y^pk8 { 



fe=l 



(6) 



'BP 



where a point (pk,fki^>k) ~ v^ v {dpdrdui)/v^ 
tains an atom (r^, w^) € 1R + xf2 with weight pfe G [0, 1]. 

With B* constituted as in ([6]), we define the ith draw 
from a negative binomial process as X; ~ NBP(J3*), 
with 



E 

fe=i 



KkiS Uk , km ~ NB(r fe ,p fe ). 



(7) 



The BP draw £?* in ([6]) defines a set of parameters 
{(pfe,r fe ,a; fe )} fc= x, 00 , and {r k ,p k ) are used within the 
negative binomial distribution to draw a count for 
atom cjfc. The {(pfc, rjfe, Wfc)}fc=i |0o are shared among all 
draws {X}, and therefore the atoms {cok} are shared; 
the count associated with a given atom u>k is a function 
of index i, represented by Kki- 



In the beta-Bernoulli process (Thibaux and Jordan 



2007), we also yield draws like {Xi} above, except 
that in that case the Kki is replaced by a one or 
zero, drawn from a Bernoulli distribution. The re- 
placement of k^ with a one or zero implies that in 
the beta-Bernoulli process a given atom Uk is either 
used (weight one) or not (weight zero). Since in the 
proposed beta-negative binomial process the Kki cor- 
responds to counts of "dish" ujk, with the number of 
counts drawn from a negative binomial distribution 
with parameters (rk,Pk), we may view the proposed 
model as a generalization to a "multi-scoop" version 
of the beta-Bernoulli process. Rather than simply ran- 
domly selecting dishes {uk} from a buffet, here each 
Xi may draw multiple "scoops" of each Wfc. 



3.2 Model Properties 

Assume we already observe {Xj}j = i, n . Since the beta 
and negative binomial processes are conjugate, the 
conditional posterior of pk at an observed point of dis- 
continuity (rfc,Wfc) is 



Pk ~ Beta(TO„ fc , c + nr k ) 



(8) 



where m nk = Y2=i K ki an d K ki = Xi{uj k }. The pos- 
terior of the Levy measure of the continuous part can 
be expressed as 

u bp n (dpdrduj) — cp^ 1 ^ — p) Cn ~ 1 dpR (dr)B (du!) 

(9) 

where c„ = c + nr is a concentration function. With 
^ and ([9]), the posterior B*\{Xi}i^ n is defined, and 



following the n o tation in |Kim| fll999|); |Miller| ( [2011) ); 
Thibauxl (120081) ; iThibaux and Jordanl (I2007I), it can 



be expressed as 



B*\{Xi} lin - BP [ c n , —R B + m nkS(r k , Uh ) ) 

\ Cn Cn k ) 



(10) 



where 



I c + m nk +nr k , if (r, u) = (r k , Wfc) G V 
c,„ = < 11 
\c + nr, if (r,u) e (R+ X Q)\V 



where V = {(rk,uJk)} k is the discrete space including 
all the points of discontinuity observed so far 



and Jordan (2007) showed that the IBP (Griffiths and 



Ghahramani 



Thibaux 



2005 1 can be generated from the beta- 



Bernoulli process by marginalizing out the draw from 
the beta process. In the IBP metaphor, customer n+1 
walks through a buffet line, trying a dish tasted by 
previous customers with probability proportional to 
the popularity of that dish among previous customers; 
additionally, this customer tries K n+ i ~ Pois(^q^) 
new dishes. By placing a count distribution on each 
observation, a BNB process prior naturally leads to 
a "multiple-scoop" generalization of the original IBP 
(i.e., a msIBP). That is, each customer now takes a 
number of scoops of each selected dish while walking 
through the buffet line. 

Unlike the IBP, however, the msIBP predictive pos- 
terior distribution of X n+ i\{Xi}l l =1 does not have a 
closed form solution for a general measure Rq. This 
occurs because Rq is not conjugate to the negative bi- 
nomial distribution. We can analytically compute the 
distribution of the number of new dishes sampled. To 
find the number of new dishes sampled, note that an 
atom {(pk,rk,LOk)} produces a dish that is sampled 
with probability 1 — (1 — Pk) Tk ■ Using the Poisson pro- 
cess decomposition theorem, we see that the number 
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of sampled dishes has distribution Pois(i/|" IBp , ■,), 



msIBP(r 



c(l - (l-p) r )p- 1 (l-p) c+nr - 1 d P R (dr). 

'0 ^0 

When Rq = Si, that is r = 1 and 7 = 1, then 
the number of new dishes sampled has distribution 



Pois 



is (a 



c+n 



Note that for choices of Rq where 



^msiBP(o) is nnite > ^msiBPW ^ as tw oo, and hence 
the number of new dishes goes to 0. Note that for the 
special case i?o = Si, the negative binomial process 
becomes the geometric process, and JlOl) reduces to 



the posterior form discussed in Thibaux ( 2008 ) ; this 



ad pi 

libaux 



simplification is not considered in the experiments, as 
it is often overly restrictive. 

3.3 Finite Approximations for Beta Process 

Since = v BP ([0 7 1] X IR + x Q) = oo, a BP generates 
countably infinite random points. For efficient compu- 
tation, it is desirable to construct a finite Levy mea- 
sure which retains the random points of BP with non- 
negligible weights and whose infinite limit converges 
to u BP . We propose such a finite Levy measure as 

v* BP (dpdrduj) = qp ce - 1 (l-p) , < 1 - e )- 1 dp J Rb(dr)Bo(dw) 

(12) 

where e > is a small constant, and we have 

j^+p = jy £ * BP ([0,l]xIR + xf7) = cryaB(ce,c(l-e)) (13) 

where B(ce,c(l — e)) is the beta function and it ap- 
proaches infinity as e — > 0. Since 



hm ^(dpdrdc) = ^ 



>o v Bp {dpdrduj) 



P 



o V 1 - p 



= 1 



we can conclude that v>* BP (dpdrduj) approaches 
v BP (dpdrduj) as e — > 0. 

Using v* BP {dpdrduj) as the finite approximation, a 
draw from eBP B* ~ eBP(c, Bq) can be expressed as 



K 



B: = \ Pk s ( 



fc=i 



K ~ Pois(^+p) 



where {(Pk,r k) u k )}k=i ir{dpdrduj) = 

i/* BP (dpdrdio) I v^gp and we have ir(dp) = 
Beta(p; ce, c(l — e))dp, ir(dr) = Ro(dr)/j and 
ir(duj) = Bo(du>)/a. A reversible jump MCMC algo- 
rithm can be used to sample the varying dimensional 
parameter K. We can also simply choose a small 
enough e and set K = E[Pois(i/^ BP )] = v+ Bp . 

A finite approximation by restricting p £ [e, 1] 



(Wolpert et al. 2011) and stick breaking representa- 



tions (Paisley et al. 2010 Teh et al. 2007) may also 



be employed to approximate the infinite model. 



4 Poisson Factor Analysis 

Given K < oo and a count matrix X £ M. PxN with P 
terms and N samples, discrete latent variable models 
assume that the entries of X can be explained as a sum 
of smaller counts, each produced by a hidden factor, 
or in the case of topic modeling, a hidden topic. We 
can factorize X under the Poisson likelihood as 

X = Pois(*0) (14) 

where £ M. PxK is the factor loading matrix, each 
column of which is a factor encoding the relative im- 
portance of each term; £ M. KxN is the factor score 
matrix, each column of which encodes the relative im- 
portance of each atom in a sample. This is called Pois- 
son factor analysis (PFA). 



We can augment (14) as 



K 

E 



Poisf 



(15) 



which is also used in Dunson and Herring ( 2005 ) for a 



discrete latent variable model. This form is useful for 
inferring <j) p k and 6 k i- As proved in Lemma 4.1 (below), 



we can have another equivalent augmentation as 

4>pk&ki 



1 L 



Pois 




Cpik 



, x piK \ ~ Mult (x pi ; (pn, • • • , Cpiif) (16) 
which assigns x P i into the K latent factors. Both aug- 



mentations in (15) and (16) are critical to derive effi- 



cient inferences, which were not fully exploited by re- 



lated algorithms (Buntine and Jakulin 2006 |Canny 



2004 Lee and Seung 2000 Williamson et al. |2010) 



Lemma 4.1. Suppose that x\,...,Xjc are indepen- 
dent random variables with x k ~ Pois(Afe) and x = 

SfeLi x k- Set A = J2k=i ^k; let (y,yi, ■ ■ .,Vk) be ran- 
dom variables such that 

Ai \k 



y~Pois(A), (j/i,...,y fc )|j/~Multh/ 



A 



A 



Then the distribution of x = (x, xi, . . . , xk ) is the 
same as the distribution of y — (y,yi, ■ ■ ■ , yx)- 

Proof. For t = [to, - ■ ■ ,tjc] £ M. K+1 and compare the 
characteristic functions (CF) of x and y, 

K 



E 



E 



H4 



it(t +t k )x k 



= P ELi^e , < i »+'t»- 1 . 



k=l 



Jt 1 y 



--E 



E 



Jt'v 



--E 



t K 

E 

\k=l 



K A e i(k>+tk) 



Since the CF uniquely characterizes the distribution, 
the distributions are the same. □ 
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4.1 Beta-Gamma-Gamma-Poisson Model 

Notice that for a sample, the total counts are usually 
observed, and it is the counts assigned to each factor 
that are often latent and need to be inferred. However, 



the BNB process in Section 3.1 does not tell us how the 



total counts are assigned to the latent factors. Recall- 
ing ([lj , the negative binomial distribution is a gamma- 
Poisson mixture distribution. Therefore, we can equiv- 
alently represent X l ~ NBP(£*) as X l ~ V(T(B*)), 
where T{B*) is a gamma process defined on B* and 
V(T{B*)) is a Poisson process defined on T{B*). Thus 
at atom u>k, the associated count Kki ~ NB(rfc,pfc) can 
be expressed as 

K-ki ~ Pois(6» fc4 ), 9 ki ~ Gamma(r fe ,p fe /(1 - p k )) (17) 

where Qki is the weight of at atom w k in the gamma 
process T(B*). If we further place a gamma prior 
on r k , then (pk,rk,9ki, Kki) naturally forms a beta- 
gamma-gamma-Poisson hierarchical structure, and we 
call the BNB process PFA formulated under this struc- 
ture the ^r-PFA. This kind of hierarchical structure 
is useful for sharing statistical strength between differ- 
ent groups of data, and efficient inference is obtained 
by exploiting conjugacies in both the beta-negative bi- 
nomial and gamma-Poisson constructions. 

Using eBP v* BP (dpdrdc/>) on [0, 1] x R+ x R p as the 
base measure for the negative binomial process, we 
construct the eBNB process and apply it as the non- 
parametric Bayesian prior for PFA. We have K ~ 
Pois^^p) = Pois(c7aT3(ce, c(l — e))), and thus K < 
oo with the equivalence obtained at e = 0. Using the 
beta-gamma-gamma-Poisson construction, we have 





K 

= S.Xpik, Xpik ~ Pois{(f)pkOk t ) 

k=l 


(18) 




~ Dir(a0, • • • ,<ty) 


(19) 




~ Gamma ( r k , — — — ) 
V 1 - Pk J 


(20) 


rk 


~ Gamma(coro, 1/co) 


(21) 


Pk 


~ Beta(ce, c(l — e)). 


(22) 



4.2 MCMC Inference 

Denote x.ik 



t^ik=l x pik- - 

c^aQ(ce, c(l — e)). 



X/p=i x pik and x.{. — 2p=i Sfe=i x pik- The size 

of K is upper bounded by 



Sampling x P ik- Use (16) 



Sampling 4> k . Exploiting the relationships be- 
tween the Poisson and multinomial distributions 
and using ^2p = i4> P k = 1, one can show that 
p([xuk, ■ ■ ■ ,xpik]\-) ~ Mult [x. ik \ fc ), thus we have 



Sampling p k . Marginalizing <fi k and 6ki out, x.ik ~ 
NB(r k ,Pk), Pk ~ Beta(ce, c(l - e)), thus 

p{pk\~) ~ Beta(ce + x.. k ,c(l -e)+Nr k ). (24) 

Sampling r k . It can be shown that p(r k \— ) oc 
Gamma(r fe ; c r , l/c ) J] i= i NB (af. jfc ; r fc ,p & ), thus 



pOfeH ~ Gamma ( c r , - ) (25) 

co - ATlog(l -pfe)/ 



if a;..fc = 0. If 7^ 0, we prove in Lemma |4.2| that 
g( r k) — logp(rfc|— ) is strictly concave if co?*o > 1, then 
we can use Newton's method to find an estimate as 



Tk 



P'{rl\-) 

p"K\-) 



which can be used to construct a proposal in a 
Metropolis-Hastings (MH) algorithm as 



r'k~N(r k ,nV?~k) (26) 
where /j, is an adjustable stepsize. Note that the adap- 



tive rejection sampling in Gilks and Wild (1992) may 
also be used to sample r k . 



Sampling 6 k i. Using (18) and (20), we have 

p(0ki\-) ~ Gamma(r fc + x. ik ,p k ). (27) 

Lemma 4.2. If x„ k ^ 0, then for any coro > 1, 
g( r k) = logp(?'fe|— ) is strictly concave. 

Proof. Since g'(r k ) = (c r - l)/r k ~ c + AT log (1 - 
Pk)-Nip(r k )+J2iLi ^( r k+x-ik) andg"(r fc ) = -(c r - 
l) /rl-Ntpi{r k )+Y^ =1 tpi(r k +x. ik ), where ip(x) is the 
diagmma function and ipi(x) is the trigamma function 
which is strictly decreasing for x > 0, if co^o > 1 and 
x.. k ^ 0, we have g"(r k ) < 0, and thus g(r k ) is strictly 
concave and has a unique maximum. □ 

5 Related Discrete Latent Variable 
Models 

The hierarchical form of the ft^T-PFA model shown in 
( 18 l-(p2| can be modified in various ways to connect to 



previous discrete latent variable models. For example, 
we can let {0 ki } t =i. N = g k and g k ~ Gamma(go/AT, 1), 
resulting in the infinite gamma-Poisson feature model 
Titsias| p008| as K 



oo. 



Thibauxl (|2008b showed 



that it can also be derived from a gamma-Poisson 
process. Although this is a nonparametric model 
supporting an infinite number of features, requiring 
{9ki}i=i,N = gk may be too restrictive. We mention 



that Broderick et al. (2012) have independently inves- 



P(<Pk 



Dir ( 



xi-k, ■ ■ ■ 



+ x P .k). (23) 



tigated beta-negative binomial processes for mixture 
and admixture models. 
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Table 1: Algorithms related to PFA under various prior 
settings. The gamma scale and shape parameters and the 
sparsity of the factor scores are controlled by pk, rt and 
Zki, respectively. 



PFA 


Infer 


Infer 


Infer 


Infer 


Related 


priors 


Pk 


Tk 




K 


algorithms 


r 


X 


X 


X 


X 


NMF 


Dir 


X 


X 


X 


X 


LDA 


pr 


/ 


X 


X 


/ 


GaP 


Sir 


X 


/ 


/ 


/ 


FTM 


/3 7 r 


/ 


/ 


X 


/ 





and Seung 2000). If we set — and a<f, = 1, then 
( 32 1 and p3| ) are the same as those of the gamma- 
Poisson (GaP) model of Canny (2004), in which set- 
ting ag — 1.1 and estimating g k with g k = E[0fej] are 
suggested. Therefore, as summarized in Table[TJ NMF 
is a special cases of the T-PFA, which itself can be con- 
sidered as a special case of the f3^T-PYA with fixed r k 
and p k . If we impose a Dirichlet prior on <p k , the GaP 
can be consider as a special case of /3F-PFA, which 
itself is a special case of the /37F-PFA with a fixed r k . 



Before examining the details, defined by pk and r k in 
p0|, z k i in (1341 and the subset of the K factors needed 



5.2 Latent Dirichlet Allocation 



to represent the data are inferred, we summarize in Ta- 
ble [T] the connections between related algorithms and 
PFA with various priors, including non-negative ma- 
trix factorization (NMF) (|Lee and Seung||2000|, latent 



Dirichlet allocation (LD A) (|Blei et ah 2003[ ), gamma- 
Poisson (GaP) ( Canny| 2004) and the focused topic 
model (FTM) ( |Williamson et al.[|2010"l ). Note that the 
gamma scale parameters p k j (1 — Pk) and Pk' /(I — Pk') 
in (20) are generally different for k ^ k' in /3 7 r-PFA, 
and thus the normalized factor score Oi = Oi /Efc=i 9 ki 
does not follow a Dirichlet distribution. This is a 
characteristic that distinguishes /^r-PFA from mod- 
els with a global gamma scale parameter, where the 
normalized factor scores follow Dirichlet distributions. 

5.1 Nonnegative Matrix Factorization and a 
Gamma-Poisson Factor Model 

We can modify the /^r-PFA into a T-PFA by letting 



4> p k ~ Gamma(a0, 1/6^) 
9 kl ~ Gamma(a e ,# fc /a e ). 



(28) 
(29) 



Using fllSh , (28) and (29), one can show that 



p{4>pk\-) ~ Gamma(a^ + x p . kl 1/(60 + k .)) (30) 
p(0ki\~) ~ Gamma(a e + x. ik , l/{a e /g k + <j>. k )) (31) 



where 9 k . = J2iLi ^ki and 



and ag > 1, using (16), (30) and (31), we can substi- 



J2v=l ^pk- If »0 > 1 



tute E[a; p ife] into the modes of 4> pk and 9 k i, leading to 
an Expectation-Maximization (EM) algorithm as 



9pfc 



>ki 



Opfc" 



a.4, — 1 
<t> P k 



E 



-- 1 EJU^e* 



'ki~ 



ag/gk + (p-k 



(32) 
(33) 



We can 
let PFA 

Oi to o t 

derivation 

p([x-a,'- 
P(0i\-) 



modify the f3^T-PFA into a Dirich- 
(Dir-PFA) by changing the prior of 
~ Dir(ae,--- , a$). Similar to the 
of p(4> k \—) in (23), one can show 
,x-ik]\ — ) ~ Mult (x.i.; Oi) and thus 
Dir (ag + x.n, ■ ■ ■ ,ag + x.^k)- Dir-PFA and 



LDA ( |Blei et al.| |2003| |Hoffman et~aTj |2010[ ) have the 



same block Gibbs sampling and variational Bayes in- 
ference equations (not shown here for brevity). It may 
appear that Dir-PFA should differ from LDA via the 
Poisson distribution; however, imposing Dirichlet pri- 
ors on both factor loadings and scores makes it essen- 
tially lose that distinction. 

5.3 Focused Topic Model 

We can construct a sparse 7F-PFA (S7F-PFA) with 
the beta-Bernoulli process prior by letting 9 k i = z k iS k i 
and 

s kl ~ Gamrna(r fe ,p fc /(1 ~Pk)), r k ~ Gamma(r , 1) 
z k i ~ Bernoulli^), ir k ~ Beta(ce, c(l — e)). (34) 

If we fix pk = 0.5, it can be shown that under the PFA 
framework, conditioning on Zki, we have 



•E-ik 



NB( 2fci r fe ,0.5) 
NB I > > w r fc ,0.5 




(35) 
(36) 

, z Kl r K ) (37) 
• , CpiK ) (38) 



where ( plk = (<f> P k6ki)/Y,k=i <t>pk6ki- Therefore, S 7 r- 
PFA has almost the same MCMC inference as the fo- 
cused topic model (FTM) using the IBP compound 
Dirichlet priors (Williamson et al. 2010). Note that 



If we set 60 = 0, ct0 = ag = 1 and gk = 00, then (32) (35) and (36) are actually used in the FTM to infer 



and ( 33 ) are the same as those of non-negative ma- 
trix factorization (NMF) with an objective function 
of minimizing the KL divergence Dkx(X||<I>0) (Lee 



rk and z k i (Williamson et al. 2010) without giving 



explicit explanations under the multinomial-Dirichlct 
construction, however, we show that both equations 
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naturally arise under S7F-PFA under the constraint 
that pk = 0.5. In this sense, S7F-PFA provides justi- 



<a)r k 



(b)p k 



fications for the inference in Williamson et al. (2010). 



6 Example Results and Discussions 

We consider the JAQVfj] and PsyRe\[^] datasets, re- 
stricting the vocabulary to terms that occur in five 
or more documents in each corpus. The JACM in- 
cludes 536 abstracts of the Journal of the ACM from 
1987 to 2004, with 1,539 unique terms and 68,055 total 
word counts; the PsyRev includes 1281 abstracts from 
Psychological Review from 1967 to 2003, with 2,566 
unique terms and 71,279 total word counts. As a com- 
parison, the stopwords are kept in JACM and removed 
in PsyRev. We obtain similar results on other docu- 
ment corpora such as the NIPS corpufj^J We focus on 
these two datasets for detailed comparison. 

For each corpus, we randomly select 80% of the words 
from each document to form a training matrix T, hold- 
ing out the remaining 20% to form a testing matrix 
Y = X T. We factorize T as T - Pois(*0) and 
calculate the held-out per-word perplexity as 



exp 



1 

y- 



P N 



2^ 2_, y pi log 



pk ki 



p—1 i—1 



J2s=l J2p=l J2k=l ^pk^ki , 



where S is the total number of collected samples, y.. = 

J2iLi Vpi anci V P i = Y (P' *)• Tne nnal results are 
based on the average of five random training/testing 
partitions. We consider 2500 MCMC iterations, with 
the first 1000 samples discarded and every sample per 
five iterations collected afterwards. The performance 
measure is similar to those used in I Asuncion et al.l 
( [20091 ) i IWaflach et al. | ( [2009] ) ; | Wang et~al~l pOlT] ) . 



As discussed in S ec. [5] and shown in Table [TJ NMF 

is a special case of T-PFA; 
special case of /?r-PFA: and 



(Lee and Seung 


2000) 


GaP ( 


Canny 


2004 1 is a 



in terms of inference, Dir-PFA is equivalent to LDA 



( Blei et al. 


2003 


Hoffman et al. 


2010 


1; and SiF-PFA 


is closely related to FTM (Williamson et al. 




2010). 



Therefore, we are able to compare all these algorithms 
with ftjT-PFA under the same PFA framework, all 
with MCMC inference. 



We set the priors of T-PFA as 



no 



1.01, 



= 10~ 6 , g k = 10 6 and the prior of /3T-PFA as 
rj; = 1.1; these settings closely follow those that lead 
to the EM algorithms of NMF and GaP, respectively. 
We find that T- and /3r-PFAs in these forms gener- 
ally yield better prediction performance than their EM 



x http:/ /www. cs.princeton.edu/~blei/downloads/ 
2 psiexp. ss.uci.edu/research/programs_data/toolbox. htm 
3 http:/ /cs. nyu.edu/~roweis/data. html 
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Figure 1: Inferred r k , p k , mean r k p k /(l - p k ) and 
variance-to-mean ratio 1/(1 — Pk) for each factor by PyT- 
PFA with = 0.05. The factors are shown in decreasing 
order based on the total number of word counts assigned 
to them. Of the ^ max = 400 possible factors, there are 209 
active factors assigned nonzero counts. The results are ob- 
tained on the training count matrix of the PsyRev corpus 
based on the last MCMC iteration. 



counterparts, thus we report the results of T- and f3T- 
PFAs under these prior settings. We set the prior of 
Dir-PFA as ag = 50/ K, following the suggestion of 
the topic model toolbox 2 provided for Griffiths and 



Steyvers (2004). The parameters of (3jT-PFA are 
To = 7 = a = 1. In this case, 



set as c 
+ 

BP 



co = 
7r/ sin(7re) 



e for a small e, thus we pre- 



set a large upper-bound K max and let e = l/-ftf ma x- 
The parameters of S7F-PFA are set as c = r = 1 
and e = l/i4r max . The stepsize in (26) is initialized as 
/i = 0.01 and is adaptively adjusted to maintain an 
acceptance rate between 25% and 50%. For all the 
algorithms, $ and are preset with random values. 
Under the above settings, it costs about 1.5 seconds 
per iteration for /3jT-PFA on the PsyRev corpus us- 
ing a 2.67 GHz PC. 

Figure [l] shows the inferred and p k , and the inferred 
mean rkPk/ '(1—Pk) and variance-to-mean ratio (VMR) 
1/(1 — Pk) for each latent factor using the /^r-PFA 
algorithm on the PsyRev corpus. Of the K = 400 
possible factors, there are 209 active factors assigned 
nonzero counts. There is a sharp transition between 
the active and nonactive factors for the values of rk and 
Pk- The reason is that for nonactive factors, pk and 



are drawn from (24) and (25), respectively, and thus 



Pk with mean ce/(c+Nrk) is close to zero and rk is ap- 
proximately drawn from its prior Gamma(cof'o, 1/co); 
for active factors, the model adjusts the negative bino- 
mial distribution with both pk and to fit the data, 
and rk would be close to zero and pk would be close to 
one for an active factor with a small mean and a large 
VMR, as is often the case for both corpora considered. 

We find that the first few dominant factors corre- 
spond to common topics popular both across and in- 
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Figure 2: Per-word perplexities on the test count matrix 
for (a) JACM and (b) PsyRev with = 0.05. The re- 
sults of P- and Dir-PFAs are a function of K. j3Y-, S^Y- 
and /?7P-PFAs all automatically infer the number of active 
factors, with the number at the the last MCMC iteration 
shown on top of the corresponding lines. 



side documents. For example, the first two most dom- 
inant topics are characterized by "proposed, evidence, 
data, discussed, experiments" and "model, models, ef- 
fects, response, predictions" in PysRev where the stop 
words are removed; the first two dominant topics in 
JACM, where the stop words are kept, are character- 
ized by "the, of, a, is, in" and "we, of, in, and, the". 
These top factors generally have large means and small 
VMRs. The remaining topics have a diverse range 
of means and VMRs. A PsyRev factor with promi- 
nent words "masking, visual, stimulus, metacontrast" 
and a JACM factor with prominent words "local, con- 
sistency, finite, constraint" are example topics with 
large mean and large VMR. A PsyRev factor "rivalry, 
binocular, monocular, existence" and a JACM factor 
"search, binary, tree, nodes" are example topics with 
small mean and large VMR. Therefore, the fi"{T-P¥A 
captures topics with distinct characteristics by adjust- 
ing the negative binomial parameters r/. and Pk, and 
the characteristics of these inferred parameters may 
assist in factor/topic interpretation. Note that con- 
ventional topic models are susceptible to stop words, 
in that they may produce topics that are not read- 



ily interpretable if stop words are not removed (Blei 



et al. 2010). Our results show that when stop words 
are present, /3"fT-PFA usually absorbs them into a few 
dominant topics with large mean and small VMR and 
the remaining topics are easily interpretable. 

Figure [2] shows the performance comparison for both 
corpora with the factor loading (topic) Dirichlet prior 
set as <20 = 0.05. In both T- and Dir-PFAs, the number 
of factors K is a tuning parameter, and as K increases, 
r-PFA quickly overfits the training data and Dir-PFA 
shows signs of ovefitting around K = 100. In /3T-, 
SjT- and /^r-PFAs, the number of factors is upper 
bounded by if max = 400 and an appropriate K is au- 
tomatically inferred. As shown in Fig. [2] f3jT-PFA 
produces the smallest held out perplexity, followed by 
S7F- and /?r-PFAs, and their inferred sizes of K at the 
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Figure 3: Per-word perplexities on the test count 
matrix of (a) JACM and (b) PsyRev as a function 
of the factor loading (topic) Dirichlet prior g 
{0.01,0.05,0.1,0.25,0.5}. The results of Y- and Dir-PFAs 
are shown with the best settings of K under each a$. The 
number of active factors under each are all automati- 
cally inferred by /3Y-, SyY- and /^P-PFAs. The number 
of active factors inferred by /37DPFA at the last MCMC 
interaction are shown under the corresponding points. 

last MCMC iteration are 132, 118 and 29 for JACM 
and 209, 163 and 30 for PsyRev, respectively. 

Figure [3] shows the performance of these algorithms as 
a function of a^. For T-PEA, is fixed. For Dir-, (3T-, 
S-fT- and (3jT-PFAs, influences the inferred sizes of 
K and the accuracies of held-out predictions. We find 
that a smaller generally supports a larger K, with 
better held-out prediction. However, if is too small 
it leads to overly specialized factor loadings (topics), 
that concentrate only on few terms. As shown in [3] 
/37F-PFA yields the best results under each and it 
automatically infers the sizes of K as a function of a^. 

7 Conclusions 

A beta-negative binomial (BNB) process, which leads 
to a beta-gamma-Poisson process, is proposed for mod- 
eling multivariate count data. The BNB process is 
augmented into a beta-gamma-gamma-Poisson hier- 
archical structure and applied as a nonparametric 
Bayesian prior for Poisson factor analysis (PFA), an 
infinite discrete latent variable model. A finite ap- 
proximation to the beta process Levy random mea- 
sure is proposed for convenient implementation. Effi- 
cient MCMC inference is performed by exploiting the 
relationships between the beta, gamma, Poisson, nega- 
tive binomial, multinomial and Dirichlet distributions. 
Connections to previous models are revealed with de- 
tailed analysis. Model properties are discussed, and 
example results are presented on document count ma- 
trix factorization. Results demonstrate that by mod- 
eling latent factors with negative binomial distribu- 
tions whose mean and variance are both learned, the 
proposed ^r-PFA is well suited for topic modeling, 
defined quantitatively via perplexity calculations and 
more subjectively by capturing both common and spe- 
cific aspects of a document corpus. 
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